86 research outputs found
Generative Exploration and Exploitation
Sparse reward is one of the biggest challenges in reinforcement learning
(RL). In this paper, we propose a novel method called Generative Exploration
and Exploitation (GENE) to overcome sparse reward. GENE automatically generates
start states to encourage the agent to explore the environment and to exploit
received reward signals. GENE can adaptively tradeoff between exploration and
exploitation according to the varying distributions of states experienced by
the agent as the learning progresses. GENE relies on no prior knowledge about
the environment and can be combined with any RL algorithm, no matter on-policy
or off-policy, single-agent or multi-agent. Empirically, we demonstrate that
GENE significantly outperforms existing methods in three tasks with only binary
rewards, including Maze, Maze Ant, and Cooperative Navigation. Ablation studies
verify the emergence of progressive exploration and automatic reversing.Comment: AAAI'2
Decentralized Policy Optimization
The study of decentralized learning or independent learning in cooperative
multi-agent reinforcement learning has a history of decades. Recently empirical
studies show that independent PPO (IPPO) can obtain good performance, close to
or even better than the methods of centralized training with decentralized
execution, in several benchmarks. However, decentralized actor-critic with
convergence guarantee is still open. In this paper, we propose
\textit{decentralized policy optimization} (DPO), a decentralized actor-critic
algorithm with monotonic improvement and convergence guarantee. We derive a
novel decentralized surrogate for policy optimization such that the monotonic
improvement of joint policy can be guaranteed by each agent
\textit{independently} optimizing the surrogate. In practice, this
decentralized surrogate can be realized by two adaptive coefficients for policy
optimization at each agent. Empirically, we compare DPO with IPPO in a variety
of cooperative multi-agent tasks, covering discrete and continuous action
spaces, and fully and partially observable environments. The results show DPO
outperforms IPPO in most tasks, which can be the evidence for our theoretical
results.Comment: 14 page
- …